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Abstract 



o 

^ ■ A fast multipole method (FMM) for asymptotically smooth kernel functions (Yr, i/r*", Gauss and 

h-^ ■ Stokes kernels, radial basis functions, etc.) based on a Chebyshev interpolation scheme has been in- 

, troduced in 0]. The method has been extended to oscillatory kernels (e.g., Helmholtz kernel) in [l^ . 

■ Beside its generality this FMM turns out to be favorable due to its easy implementation and its high 
^Nj ' performance based on intensive use of highly optimized BLAS libraries. However, one of its bottlenecks 

is the precomputation of the multiple-to-local (M2L) operator, and its higher number of floating point 
operations (flops) compared to other FMM formulations. Here, we present several optimizations for that 
operator, which is known to be the costliest FMM operator. The most efficient ones do not only reduce 

■ the precomputation time by a factor up to 340 but they also speed up the matrix- vector product. We 
^ ' conclude with comparisons and numerical validations of all presented optimizations. 



1 Introduction 



The fast multipole method (FMM) is a method first designed in fi\ to reduce the cost of matrix-vector 
Q>^ I products from 0{N'^) to 0{N) or 0{N log N) depending on the underl ying kernel function. Most FMM 

■ variants have been developed and optimized for specific kernel functions (l3l Isl. [sl [ij. However, some have 
C I a formulation that is independent of the kernel function [13, [13, S S • The paper at hand addresses the 

optimization of one of these formulations, the so called black-box FMM (bbFMM), presented in [5j]. It is 
based on the approximation of the kernel function via a Chebyshev interpolation and is a black-box scheme 

■ for kernel functions that are asymptotically smooth, e.g., l/(r^ -|-c^)"/^ with r = \x — y\, c a given parameter 
and neN. The bbFMM has been extended to the directional FMM (dFMM) for oscillatory kernels in 12]. 
It is suitable for any kernel function of the type g{r)e^^'^ where g{r) is an asymptotically smooth function 
(z^ = -1 is the imaginary unit and k the wavenumber). 

The main idea of the FMM is to properly separate near-field {\x — y| -> 0) and far-field {\x — y| — > oo). 
\ The near-field is evaluated directly and the far-field can be approximated and thus computed efficiently. In 

this paper, we will denote M2L the operator that transforms a multipole expansion into a local expansion. 
The M2L operator is the costliest step in the method, since it needs to be applied to all clusters in the 
interaction list, that is 189 times for each cluster for a typical configuration in bbFMM. In this paper we 
focus on various optimizations of this operator for both the bbFMM and the dFMM. 

First, we address the optimization proposed in [1, In that paper, the M2L operator between a cluster 
and its interaction list is further compressed using a singular value decomposition (SVD). The key idea is that 
the SVD provides the smallest possible rank given an error e (therefore leading to the smallest computational 
cost). In |5j], a single SVD is used to compress all the M2L operators at a given level using a single 
singular vector basis. However, the singular values of individual M2L operators decay at different rates; it 
is often the case for example that cluster pairs that are separated by a small distance have higher rank than 
pairs that are separated by a greater distance. If we denote w the diameter (or width) of a cluster and d 
the distance between two clusters, then the smallest distance corresponds to dmin = 2?/; while the largest is 
c^max = 3\/3w;. The ratio dmax/'^min = 3\/3/2 is in fact relatively large. This can be taken advantage of to 
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further compress the M2L operator on a cluster-pair basis leading to an even smaller computational cost 
than the original algorithm [5|. This point is investigated in details in this paper. 

Another bottleneck in the FMM of Q is the precomputation time of the M2L operators. We therefore 
introduce a new set of optimizations that exploit symmetries in the arrangement of the M2L operators. For 
example, for bbFMM, this allows us to express all M2L operators (316 = 7^ — 3'^) as permutations of a subset 
with only 16 unique M2L operators. These permutations correspond to reflections across various planes (for 
example a; = 0, y = 0, z = 0, etc). Table [5] for example reports reductions in precomputation time by a 
factor of about 50-200x. 

Besides drastically reducing precomputation time and memory requirement this approach paves also the 
road for various algorithmic improvements. Modern processors use memory cache to enhance performance 
and, as a result, schemes that block data (thereby reducing the amount of data traffic between main memory 
and the floating point units) can result in much higher performance. In our case, the use of symmetries 
allows calling highly optimized matrix-matrix product implementations (instead of matrix- vector) that have 
better cache reuse. Overall, this may result in acceleration by a factor 4-6x for smooth kernels (Tables |BH51) . 
We also present results using oscillatory kernels of the type mentioned above {g{r)e^^'^). In that case, the 
acceleration is more modest and ranges from about 1.2 to 2.7x (Table [5HTT|). 

In this paper we therefore focus both on performance and precomputation time. Both are important 
factors for the FMM: the precomputation is crucial if we are interested in only one matrix-vector product, 
while in other cases, such as the solution of a linear system, the fast application of matrix-vector products 
is central. 

The paper is organized as follows. In Sec. [U we briefly recall the bbFMM and the dFMM and introduce 
notations that are needed for explanations later in this paper. In Sec. [31 we address the separation of near- 
and far-field, introduce the notion of transfer vector to uniquely identify M2L operators, and describe how 
the kernel (smooth or oscillatory) affects the interaction list. We start Sec. H] with a brief recall of the known 
optimization of the M2L operator (see [1, [l3|) and suggest further improvements. Then, we present a new 
set of optimizations and explain them in details. These include using a low-rank approximation of M2L for 
individual interactions (individual cluster pairs), along with the definition of symmetries and permutations 
and how they are used to reduce the computational cost. Finally, in Sec. [5l we present numerical results. 
We compare all variants for bbFMM and dFMM with three datasets corresponding to a sphere, an oblate 
spheroid, and a prolate spheroid. The measured running time of the FMM, along with the precomputation 
time, and its accuracy as a function of various parameters are reported. The efficiency of the algorithms 
as a function of the target accuracy is considered. We also performed tests using two different kinds of 
linear algebra libraries for the matrix-matrix products and other operations (an implementation of BLAS 
and LAPACK vs the Intel MKL library). 

2 Pairwise particle interactions 

The problem statement reads as follows. Assume, the cells X and Y contain source {j/jl^Li and target 
particles {xi]fLi^ respectively. Compute the interactions 



The kernel function K{x, y) describes the influence of the source particles onto the target particles. The cost 
of directly evaluating the summation in Eqn. ([T]) grows like 0{MN) which becomes prohibitive as M, N ^ oo 
and it is why we need a fast summation scheme. 

2.1 Fast summation schemes based on Chebyshev interpolation 

For a detailed derivation and error analysis of the FMM based on Chebyshev interpolation we refer the reader 
to [l2| • We adapt most of their notations and repeat only concepts which are necessary to understand 
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explanations hereafter. 

Let the function / : M'^ — > C be approximated by a Chebyshev interpolation scheme as 

f{x)^ ^ Si{x,Xa)f{Xa) (2) 
\a\<3l 

with the 3-dimensional multi-index a = (ai,a2,Q^3) and \a\ = max(ai, 02, as) with € (1,...,^). The 
interpolation points = (xq,j , Xq-j , x^a) are given by the tensor-product of the Chebyshev roots Xa- of the 
Chebyshev polynomial of first kind Ti{x) = cos(arccosx) with x E [—1,1]. The interpolation operator reads 
as 

Sl{x,Xa) = Si{xi,Xai) Sl{x2,Xa2) Siix^^Xa^). (3) 

For the interpolation on arbitrary intervals, we need the affine mapping $ : [—1,1] — > [a,b]. We omit it 
hereafter for the sake of readability. 

2.1.1 Black box FMM (bbFMM) 

If two cells X and Y are well separated, we know from [5] that asymptotically smooth kernel functions can 
be interpolated as 

K(x,y) - ^ Si{x,Xa) ^ K{xa,yp) Si{y,yp). (4) 

\a\<i \l3\<i 

We insert the above approximation into Eqn. ^ and obtain 

N 

fi-^ ^ Si{x^,Xa) ^ K(xa,yi3)^Si{yj,y^) Wj (5) 

\a\<e \i3\<e j=i 

which we split up in a three-stage fast summation scheme. 

1. Particle to moment (P2M) or moment to moment (M2M) operator: equivalent source values are 
anterpolated at the interpolation points yp Cz Y hy 

N 

2. Moment to local operator (M2L): target values are evaluated at the interpolation points Xa & X hy 

= H for \a\<L (7) 

\P\<e 

3. Local to local (L2L) or local to particle (L2P) operator: target values are interpolated at final points 
Xi e X hy 

/, ~ ^ 5(a:„x„)F„ for i = l,...,M. (8) 

\a\<e 

Recall, the cells X and Y are well separated and thus all contributions of fi can be computed via the above 
presented fast summation scheme (no direct summation is necessary). 
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2.1.2 Directional FMM (dFMM) 

Whenever we deal with oscillatory kernel functions, such as the Helmholtz kernel, the wavenumber k conies 
into play. Depending on the diameter of the cells X and Y and the wavenumber they are either in the 
low-frequency or in the high-frequency regime. In the low-frequency regime the fast summation schemes of 
the dFMM and the bbPMM are the same. In the high-frequency regime the fast summation scheme becomes 
directional. From fl2l| we know that any oscillatory kernel function K{x,y) = C?(a;, ?/)e''^l^~^l, where G{x,y) 
is an asymptotically smooth function, can be rewritten as 

= i^"(a;,y)e*'''"-(^"^) with y) G(x, 2/)e''=(l^-''l-"-(^-2'». (9) 

We assume that the cells X and Y of width w are centered at Cx and Cy and Cy lies in a cone of direction u 
being centered at Cx (think of the domain around X being virtually subdivided in cones given by directional 
unit vectors {ud^j^, where C is determined by their aperture). If the cell pair [X, Y) satisfies the separation 
criterion 0(kw'^) and the cone-aperture criterion ©(l/fcw), the error of the Chebyshev interpolation of the 
kernel function 

K^{x,y)^ J2 Seix,xa.) ^ K^{x^,yp) Se{y,yfi) (10) 

\a\<e \I3\<£ 

decays exponentially in the interpolation order i (independent of the wavenumber k; see 11, l2i)- We insert 
the above interpolated kernel function in Eqn. ^ and obtain 

N 

\a\<e \i3\<e j=i 

for all i = 1, . . . ,M. Similarly as with the bbFMM, a three-stage fast summation scheme for oscillatory 
kernels in the high-frequency regime can be constructed. 

1. Particle to moment (P2M) or moment to moment (M2M) operator: equivalent source values are 
anterpolated at the interpolation points y/3 £ Y hy 



N 



2. Moment to local operator (M2L): target values are evaluated at the interpolation points Xa E X hy 

F„"= ^ i^(x„,y^)W-^" for \a\ < L (13) 

l/3|<« 

3. Local to local (L2L) or local to particle (L2P) operator: target values are interpolated at final points 

Xi <E X hy 

/,~e^^"-"' ^ 5(x„x„)e-^^"-*°i^^ for z = l,...,M. (14) 

\a\<i 

Even though the bbPMM and the dFMM are here presented as single-level schemes, they are usually 
implemented as multilevel schemes. Strictly speaking, the steps one and three of both schemes are the P2M 
and L2P operators. Let us recall briefly on the directional M2M and L2L operators of dPMM: based on 
the criterion 0{l/kw), the aperture of the cones at the upper level is about half the aperture at the lower 
level. Due to a nested cone construction along octree levels, we preserve the accuracy of the Chebyshev 
interpolation within the multilevel scheme. For a detailed description of all operators we refer to 0, [l^). 
Note, the similarity of the M2L operators (step two of both schemes) of bbPMM and dPMM. In fact, the 
only difference in their implementation is that in the bbPMM case we have one loop over all cell pairs, 
whereas in the dPMM case we have two loops: the outer loop over all existing cones of direction {wcj^i 
and the inner loop over all cell pairs lying in the current cone. In this paper, we focus on the M2L operators 
and their efficient numerical treatment. 
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3 M2L operators 



The first step of any FMM consists in a proper separation of near- and far-field. After that, the near-field 
is evaluated directly and the far-field efficiently using a fast summation scheme. In this section, we focus 
on the first step. The union of near- and far-field of a target cell X is spatially restricted to the near-field 
of its parent cell. Algorithm [T] explains how these interactions are computed for dFMM . The recursive 
partitioning starts with the two root cells X and Y of the octrees for source and target particles. If a pair 
of cells satisfies the separation criterion in the high- or low-frequency regime, y is a far-field interaction of 
X. Else, if they are at the leaf level of the octree, F is a near-field interaction of X. If none is true, the cell 
is subdivided and the tests are repeated recursively. In line [3] in Alg.[T]we use the term directional far- field. 

Algorithm 1 Separate near- and far-field in the low- and high-frequency regime 
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function SeparateNearAndFarField(X, Y) 

if (AT, Y) are admissible in the high-frequency regime then 

add Y to the directional far-field of X return 
else if {X, Y) are admissible in the low-frequency regime then 

add Y to the far-field of X return 
else if {X, Y) are leaves then 

add Y to the near-field of X return 
else 

for all Xchiid G X and all ychiid G Y do 

SEPARATENEARANDFARFlELD(Archild, Ychiid) 

end for 
end if 
end function 



a concept explained in detail in 12]: In the high-frequency regime the far-field is subdivided into cones of 
direction u needed by the directional kernel function K'^(x, y). Each source cell Y is assigned to a cone and 
there are as many directional far-fields as there are cones. 

3.1 Transfer vectors 

In order to address interactions uniquely we introduce transfer vectors t — (^1,^2,^3) with t ^ 1? . They 
describe the relative positioning of two cells X and Y and are computed as t = ('^^~<^v)/w where Cx and Cy 
denote the centers of the cells X and Y and w their width. In the following we use transfer vectors to 
uniquely identify the M2L operator that computes the interaction between a target cell X and a source cell 
Y . In matrix notation an M2L operator reads as Kt of size x and the entries are computed as 

{^t)m(a)n(P) ^ K[Xa,Vl3) (15) 

with the interpolation points G AT and yp G Y . Bijective mappings 

m,n :{!,..., n X {!,..., n X {1, ...,£} ^ {1, (^g) 

with m^^ {m{a)) = a and n~^(n(/3)) = /3 provide unique indices to map back and forth between multi-indices 
of Xa and y/3 and rows and columns of Kj. We choose them to be m{a) — (ai — 1) + (a2 — 1)1+ [a^ — l)£^ + 1 
and same for n{f3). 

3.2 Interaction lists 

Very commonly fast multipole methods are used for translation invariant kernel functions K{x,y) — K{x + 
v,y + v) for any v E M.^. Because of that and because of the regular arrangement of interpolation points x 
and y in uniform octrees it is sufficient to identify unique transfer vectors at each level of the octree and to 
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compute the respective M2L operators. In the following we refer to such sets of unique transfer vectors as 
interaction lists T dl? . 

If we consider asymptotically smooth kernel functions the near-field is limited to transfer vectors satisfying 
1^1 < "v/S; it leads to 3'^ — 27 near-field interactions (see |5]). In a multi-level scheme, these 27 near-field 
interactions contain all 6'^ = 216 near- and far-field interactions of its eight child-cells. Far-field interactions 
are given by transfer- vectors that satisfy \t\ > ^/S. This leads to a maximum of 6'^ — 3^ = 189 interactions per 
cell and we end up with the usual overall complexity of 0{N) of fast multipole methods for asymptotically 
smooth kernel functions. The union of all possible far-field interactions of the eight child cells gives 7^ — 3"^ = 
316 interactions. That is also the largest possible number of different M2L operators have to be computed per 
octree level. Most asymptotically smooth kernel functions happen also to be homogeneous K{ar) — a^'-K{r) 
of degree n. In other words, if we scale the distance r = |a: — ?/| between source and target by a factor of a the 
resulting potential is scaled by a", where n is a constant that depends on the kernel function. The advantage 
of homogeneous kernel functions is that the M2L operators need to be precomputed only at one level and can 
simply be scaled and used on other levels. This affects the precomputation time and the required memory. 

If we consider oscillatory kernel functions we need to distinguish between the low- and high-frequency 
regime [l^ . The admissibility criteria in the low- frequency regime are the same as those for asymptotically 
smooth kernel functions. However, in the high-frequency regime the threshold distance between near-field 
and far-field is 0{kw^); nonetheless, as shown in [1^ we end up with the usual complexity of 0{N \ogN) 
of fast multipole methods for oscillatory kernel functions. It depends on the wavenumber k (a property 
of the kernel function). Thus, the size of near- and far- field is not a priori known as it is in the case of 
asymptotically smooth kernel functions. Table [T] summarizes the number of near and far-field interactions 
to be computed depending on different kernel functions. 

Table 1: Number of near- and far- field interactions depends on the kernel function 



type of kernel function 


cells in near-field 


cells in far-field 


smooth 

smooth and homogeneous 
oscillatory 


< 27 per leaf 

< 27 per leaf 
depends on k 


< 316 per level 
< 316 for all levels 
depends on k 



4 Optimizing the M2L operators 

In all fast multipole methods the M2L operator adds the largest contribution to the overall computational 
cost: for bbFMM it grows like 0{N) and for dFMM like O(A^logiV). In this section, we first briefly recaU 
the optimization that was used up to now and suggest an improvement. Then, we present a new set of 
optimizations that exploit the symmetries in the arrangement of the M2L operators. 

4.1 Single low-rank approximation (SA) 

In the following we explain the basics of the optimization used in [l^ and we refer to it as the SA variant 
hereafter. The idea is based on the fact that all M2L operators Kf with t G T can be assembled as a single 
big matrix in two ways: either as a row of matrices K^''°"^ — [Ki, . . . , Kt, . . . , K|j-|] or as a column of matrices 
K(coi) ^ . . . ; Kt; . . . ; K|T|] of M2L operators. The cardinality |r| gives the number of transfer vectors in 
the interaction list T. Next, both big matrices are compressed using truncated singular value decompositions 
(SVD) of accuracy e as 

K(™^) ~ UZV* and K*™^) - AfB* (17) 

with the unitary matrices U, B of size £ x r and V, A of size \T\£^ x r and the r singular values in Z, f. With 
a few algebraic transformations each M2L operator can be expressed as 

Kt ~ UQB* where = U*KfB of size r x r is computed as = ZV*B or Cf = U*Atr. (18) 
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The advantage of this representation is that the cost of applying the M2L operator gets reduced from 
applying a matrix of size i'^ x to a matrix of only r x r. Moreover, less memory is required. However, the 
precomputation time grows cubically with the accuracy of the method due to the complexity of the SVD. 



In [121 the SVD has been substituted by the adaptive cross approximation (ACA) followed by a truncated 
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SVD |1|. The precomputation time has been cut down drastically due to the linear complexity of the ACA. 



4.1.1 SA with recompression (SArcmp) 

If we use the S A variant the achieved low-rank r is the same for all M2L operators given by Ct . To a large 
extent, r is determined by the greatest individual low-rank of the M2L operators. This means that most of 
the matrices Ct of size r x r have effectively a smaller low-rank rt < r. We exploit this fact by individually 
approximating them as 

Ct ^ UtVj with Ut, Vt of size r x rt and the constraint < r/2. (19) 

Without the constraint the low-rank representation is less efficient than the original representation. The 
effects of the recompression are studied in Sec. 15.21 

4.2 Individual low-rank approximation (lA) 

As opposed to the SA approach, an individual low-rank approximation of the M2L operators as 

Kt - UtV* with UtVt of size £^ x n (20) 

directly leads to the optimal low-rank representation of each of them. As in the previous section, the 
approximation can be performed by either a truncated SVD or the ACA followed by a truncated SVD (note, 
the rank rt in the Eqns. pj)) and ((20|) might be similar but has not to be the same). All these variants (SA, 
SArcmp and lA) still require the approximation and storage of all M2L operators. In terms of time and 
memory however, it would be desirable to come up with a method that requires the approximation and the 
storage of a subset of operators only. Let us present a set of optimizations that fulfill these two requests. 



4.2.1 Symmetries and permutations 

Here, we illustrate how the full set of M2L operators can be expressed by a subset only. The idea is based on 
symmetries in the arrangement of M2L operators and exploits the uniform distribution of the interpolation 
points. We start by presenting the idea using a model example. After generalizing this idea, we demonstrate 
that M2L operators can be expressed as permutations of others. 



Model example The target cell X in Fig. [T] has three interactions Yt with the transfer vectors t e 
{(2, 1), (1, 2), (—2, 1)}. We choose the reference domain to be given by ti > t2 > 0. The goal is to express the 
M2L operators of all interactions via M2L operators of interactions that lie in the reference domain only. In 
our example this is the interaction with the transfer vector (2, 1). The two transfer vectors (1, 2) and (—2, 1) 
can be shown to be reflections along the lines given by ii = and ti ~ t2, respectively. We easily find that 
any t G 7? can be expressed as a reflection (or a combination of reflections) of transfer vectors that satisfy 
ii > t2 > 0. 

We claim that any reflection of a transfer vector corresponds to a permutation of the respective M2L 
operator. Recall that the evaluation of K(xa,yp) gives the entry from row m(a) and column n(/3) of the 
M2L operator. ^(2,1) is the only M2L operator whose transfer vector satisfies ti >t2 > 0. The multi-indices 
are constructed as presented in Fig.[T] As can be checked, the entry (K(2,i))mn is not the same as {^(i^2))mn 
or (K(-_2^i))mn- However, if we use the permuted multi-indices from Fig.[5a|for K(_2,i) or those from Fig. I2bl 
for K(]^ 2) they are the same. The logic behind this can be summarized as follows. 
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a = (ai, 02) 

/3=(/3i,/32) 



• 


• 


• 


(1,3) 


(2,3) 


(3.3) 


• 


• 


• 


(1,2) 


(2,2) 


(3,2) 


• 


• 


• 


(1,1) 


(2,1) 


(3.1) 



Y, 



(-2,1) 



a2 



• • • 

(1,3) (2.3) (3,3) 

(1,2) (2,2) (3,2) 

• • • 

(1.1) (2.1) (3,1) 




^(1,2) .| 
/ P2 


• • • 

(1,3) (2,3) (3,3) 

• • • 

(1,2) (2,2) (3.2) 

• • • 

(1,1) (2,1) (3,1) 



(1.3) (2^) ,(3,3) 

^A2gi,_ 

(1,2) .f2|2) (3,2) 



/l.l) (2,1) (3.1) 



(24) 



/3i 



X 



Figure 1: Axial and diagonal symmetries of interactions. The interpolation points Xa and j//3 are indexed by 
the multi-indices a and /3, respectively (interpolation order £ — 3). The only transfer vector that satisfies 
ti > t2 > is t = (2, 1). In that case, we claim that K(2.i) is the only M2L operator we need to compute. 
The direction of the arrows indicates the growth of the respective multi- index component. 



• If an axial symmetry is given by ii = as shown in Fig. I2a[ we invert the corresponding component of 
the multi-index as 

a^(£-(ai-l),a2) and /3 ^ (£ - - 1), /32). (21) 

• If the diagonal symmetry is given by ti = t2 as shown in Fig. l2b| we swap the corresponding components 
as 

a^{a2,ai) and /3 ^ (^2,/3i)- (22) 

Sometimes it is necessary to combine axial and diagonal permutations. Take as example the transfer vec- 
tor (—1,2): we need to flip it along ti = and then along ti = t2 to get (2,1). Note that reflections are 
non commutative, i.e., the order of their application matters. This is also true for permutations of the M2L 
operators. 



Generalization Let us extend the above concept to the three dimensional case. We start by introducing 
three axial and three diagonal symmetries in Z'^. 

• Axial symmetry planes are given by ti — 0, t2 = and = (see Fig. [3]). Each of the three planes 
divides in two parts, i.e., the negative part ti < and the positive part ti > 0. By combining all 
three planes Z^ is divided into octants. In the following we use Zj^, i.e., the octant with ti,t2,t^ > 
as reference octant. 

• Diagonal symmetry planes are given by ti — t2, ti = t^ and t2 — t^ (see Fig. 14]). In Z'^ there are six 
diagonal symmetries; however, we restrict ourselves to the symmetries affecting the reference octant 

By combining the three diagonal symmetries and the three axial symmetries we obtain the cone shown in 
Fig. [5l We refer to it as Zgy,„; it is given by 

Z3y„^ = {Z^y,^ C Z^ : ti > i2 > ^3 > with teZ^} . (23) 
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(-2,1) 





a - 






a-- 




• 

(3,3) 


• 

(2,3) 


• 

(1,3) 


• 

(3,2) 


• 

(2,2) 


• 

(1.2) 


• 

(3,1) 


• 

(2,1) 


• 

(1.1) 



l),a2) 
l),/32) 



03 











• • 

(3,3) (2,3) 


^(1.3) 


X 


(3,2) ,t2|2) 

V • 

X3.1) (2,1) 


■ -•- 

(1,2) 

• 

(1,1) 










A ^ 





(a) Invert the component a\ and /3i due to the axial symmetry 
tx = 0. 



a = (q(2, ai) 

/3=(/32,/3i) 



(3,1) (3.2) (3,3) 

• • • 

(2,1) (2.2) (2,3) 



• • • 

(1,1) (1,2) (1,3) 



(3.1) (3,2) ^(^,3) 
(2,1) ,(212) (2,3) 



Xl.l) (1 



2) (1,3) 



(1,2) 



X 



OL2 



(b) Swap the components a\ 02 and /3i /32 due 
to the diagonal symmetry ti = t2- 



Figure 2: The direction of the arrows indicates the growth of the respective multi- index component such 
that the M2L operators K(_2^i) and K(i_2) become the same as K(2,i)- In other words, the mapping from 
the arrows in Fig. [T] to the arrows here is analog to the mapping of the muhi- indices and corresponds to the 
permutations of K(2.i) such that it coincides with K(_2^i) and K(i 2). 



By its means we can identify the subset of transfer vectors Tgym C T C as 

T,y^-rnz3y„^ (24) 

such that ah others T\Tsyni can be expressed as reflections of this fundamental set. Next, we claim that 
these symmetries are also useful for M2L operators. 

Permutation matrices Any reflection of a transfer vector along a symmetry plane determines the per- 
mutation of its associated M2L operator as 

K, = PtK,(,)P7. (25) 

The permutation matrix Pt depends on the transfer vector t Cz T. We also need the surjective mapping 
p : T ^ Tsym] it associates every transfer vector in T to exactly one in Tgym- The left application of Pt, 
essentially, corresponds to the permutation of a and its right application to the permutation of /3, affecting 
rows (respectively columns) of the original matrix ^p{t)- Note, the permutation matrices Pt depend only on 
the transfer vector t. How do we construct them? For some t we introduce axial and diagonal permutations 
nf^ and nf' that read as follows. 

• Axial symmetries: multi-index permutations are computed as 

nf' {ai,a2, as) ^ {ai, 6:2,0:3) with 6^ ^ l'^' if i» _ 0, ^26) 

\i — [ai — 1) else. 

There exist 8 different possibilities that correspond to the octants presented in Fig.|3l Note, TTf{a) = a 
is only true for transfer vectors with ^1,^2,^3 > 0. 
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Figure 3: Three axial symmetry planes split 1? in octants. The reference octant is given by t\,ii-,t-i > 0. 




ei ei ei 

(a) ti = t2 (b) ti = ts (c) t2 = ts 

Figure 4: Three diagonal symmetry planes in the reference octant. 




ei 

Figure 5: The final cone Z^y^ {ti > t2 > ^3 > 0) is obtained by combining axial and diagonal symmetries. 
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• Diagonal symmetries: multi- index permutations are computed as 

7rf (ai,a2,a3) = {ai,aj,ak) such that > \tj\ > \tk\. (27) 

There exist 6 different possibihties that correspond to the 6 different cones if we consider Fig. [S] Note 
again, ■7Tf'{a) = a is only true for transfer vectors with ti > t2 > > 0. 

Given these multi-index permutations and the mapping functions m{a) and n(/3) we can define a permutation 
matrix Pt of size £^ x i^. Its entries are except in column j the entry i — ■m{TTt{m~^{j))) is 1. Let us go 
through the computation of this index: first, we compute the multi-index a — m~^{j), then, we permute the 
multi-index a = TTt{a) and last, we compute the row-index i — m(a). Permutation matrices may be written 
as 

Pt = (em(7rt(m-i(0))): em(7rt(m-i(l)))j ■ ■ • : ^m(TTt(rn-'^ {(£-1^)))) i (28) 

where ej denotes a column unit vector of length £^ with 1 in the jth position and elsewhere. Permutation 
matrices are orthogonal PtPj = I, hence, the inverse exists and can be written as P^^ = Pj'^. Note that the 
combination of permutations is non commutative. Given these permutations and we setup P^ and 
PP and construct the permutation matrix as 

Pt = p? p^ (29) 

The permutation for the multi-index f3 is the same. 

4.2.2 lA with symmetries (lAsym) 

By exploiting the above introduced symmetries we end up with an optimization we refer to as the lAsym 
variant. We individually approximate and store only M2L operators with t e Tsy^ and express all others via 
permutations as shown Eqn. (|25p . The lAsym variant for an arbitrary transfer vector t € T consists of the 
following three steps. 

1. Permute multipole expansions 

wt = P^w (30) 

2. Gompute permuted local expansions 

ft = Kp(t)Wt (31) 

3. Un-permute local expansions 

f = Ptft (32) 

Note that the permutation matrix is not applied to the actual M2L operator (remains unchanged as can be 
seen in step 2). Its application is implemented as a reordering of vector entries (step 1 and 3). Depending 
on whether the M2L operator exist in its full-rank or in its low-rank representation (see Eqn. (HO])) the 
application corresponds to one or two matrix- vector products. In the following we introduce a blocking 
scheme that leads to a faster execution on a computer. 

4.2.3 lAsym with blocking (lAblk) 

We know from Sec. 14.2.11 that based on the consideration of symmetries and permutations, many interac- 
tions share the same M2L operators. This paves the road for blocking schemes. Essentially, the idea is to 
substitute many matrix-vector products by a few matrix-matrix products. Blocking schemes do not change 
the overall complexity of the algorithm, but they allow for the use of highly optimized matrix-matrix prod- 
uct implementations. Such achieve much higher peak performances than optimized matrix-vector product 
implementations due to better cache reuse and less memory traffic [4, ,9] . 

In our concrete case, we use the blocking scheme to block multipole and local expansions. Instead 
of permuting them and applying the M2L operators individually (matrix- vector products), we assemble 
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Algorithm 2 Blocking scheme with |Tsyi„| matrix-matrix products 



function BlockedM2L (target cell X and all far-field interactions ly) 
allocate Fp and Wp for p = 1, . . . , |T|sym 
retrieve f from X 
set all Cp = 

for all source cells Y in ly do 

retrieve w from Y and compute t from cell-pair (X, Y) 

column Cp(t) of '^p{t) gets P^w » Permute multipole expansions 

increment Cp(() 
end for 
for all {Kp} do 

Fp ^ KpWp > Compute permuted local expansions 

end for 
set all Cp = 

for all source cells Y in ly do 
compute t from cell-pair {X, Y) 
retrieve ft from column Cpj-jj of ^p(t) 
increment Cp(t) 

f f + P^fj > Permute permuted local expansions 

end for 
end function 



those that share the same M2L operator as distinct matrices to whom we apply the M2L operators then 
(matrix-matrix products). Algorithm [2] explains this in details. We need the matrices Wp and Fp of size 
(.'^ X Up for p = 1, • . • , |rsyni|- Their columns store the permuted multipole and the resulting (also permuted) 
local expansions. The values for rip indicate how many interactions in T share the same M2L operator of 
interactions in Tgym, in other words, ni + • • • + np -I- • • • -f 'T-|Tay,„| — \T\ is true. In the case of bbPMM this is a 
priori known, since the full interaction list is a priori given (see Sec. l3.2[) . That is not the case for dPMM and 
the values for rip have to be determined during a precomputation step. We also need counters Cp to indicate 
the position of the currently processed expansions in Wp and Fp. As opposed to lAsym, here we split up 
the single loop over all interactions into three loops. In the first one, we assemble the set of matrices Wp. 
At the end Cp < Up is true for all p. In the second loop, we perform at most |Tsy,„| matrix-matrix products. 
And in the last loop, we increment the local expansion with the expansions from all Fp. 

Blocking along multiple target cells Algorithm [5] proposes to use the blocking scheme for all interac- 
tions of only one target cell. In the worst case no M2L operator is shared and the algorithm coincides with 
lAsym. Moreover, the size of the matrices Wp and Fp might vary to a large extent. That is why we pursued 
the blocking idea further to come up with a more efficient scheme. Instead of using individual Up we choose 
it to be a constant n for a\\ p — 1, . . . , iTsymj. Then we keep on blocking expansions using interactions lists 
of multiple (as opposed to one) target cells. Once Cp = n is true for some p, we apply the M2L operator as 
Fp = Kp\Np where Wp, Fp are both of size i'^ x n. In our numerical studies we use this blocking scheme with 
n = 128. 



5 Numerical results 

In the previous sections we introduced various optimizations of the M2L operators for bbPMM and dPMM. 
As representative kernel functions we use 

the Laplace kernel K{x,y) = — ; : and the Helmholtz kernel K(x,y) = — ; (33) 

4:n\x — y\ Att\x — y\ 
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In the numerical studies, hereafter, we use the same parameter setting as are used in the respective pubh- 
cations and for dFMM we use the wavenumber k = 1. All computations are performed on a single CPU of 
a Intel Core i7-2760QM CPU @ 2.40GHz x 8 with 8GB shared memory. We used the compiler gcc 4.6.3 
with the flags "-02 -ffast-math" . The source files (C++) of the program we used can be downloaded via 
http://github.com/burgerbua/dfmm; they are distributed under the "BSD 2-Clause" license. 

M2L optimizations We show and analyze results for the following eight variants: 

1. NA: the full interaction list T is represented by full-rank (not approximated) M2L operators 

2. NAsym: the reduced interaction list Tgym is represented by full-rank (not approximated) M2L operators 

3. NAblk: same as NAsym but with additional blocking of multipole and local expansions 

4. SA: the variant presented in Q and briefly sketched in Sec. 14.11 

5. SArcmp: same as SA but with additional recompression of all Ct 

6. lA: same as NA but with low-rank (individually approximated) M2L operators 

7. lAsym: same as NAsym but with low-rank (individually approximated) M2L operators 

8. lAblk: same as NAblk but with low-rank (individually approximated) M2L operators (see Alg.[21) 

Moreover, we study two different low-rank approximation schemes for the SA and lA variants: on one 
hand we use a truncated singular value decomposition (SVD) and on the other hand the adaptive cross 
approximation (ACA) followed by a truncated SVD [ij . 

Example geometries We use the three benchmark examples shown in Fig. [S] and described in the listing 
below. The depth of the octree is chosen such that near- and far- field are balanced, i.e., we want the fastest 
possible matrix- vector product. 

1. The sphere from Fig. [^alis contained in the bounding-box 64 x 64 x 64; 168 931 particles are randomly 
scattered on its surface. The octree has 6 levels. 

2. The oblate sphere from Fig. [6b] is contained in the bounding-box 6.4 x 64 x 64; 125 931 particles are 
randomly scattered on its surface. The octree has 6 levels. 

3. The prolate sphere from Fig. [Sclis contained in the bounding-box 6.4 x 6.4 x 64; 119 698 particles are 
randomly scattered on its surface. The octree has 7 levels. 

In approximative sense, the sphere is a three-dimensional, the oblate sphere a two-dimensional and the 
prolate sphere a one-dimensional object in M.^. We choose these three geometries to study their influence on 
the performance on the dFMM. Table [5] shows the size of the full and the reduced interaction lists (|r| and 
l^syml) per level (If stands for low- frequency, hf for high-frequency regime) for all three geometries. The size 
of the interaction lists clearly grows with the dimensionality of the geometry. We report on the impact of 
this behavior later. 







sphere 






oblate 


sphere 






prolate 


sphere 






3(hf) 


4(hf) 5(hf) 


6(hf) 


3(hf) 


4(hf) 


5(hf) 


6(hf) 


4(hf) 


5(hf) 


6(hf) 


7(lf) 


\T\ 


668 


18710 2666 


418 


60 


2336 


1502 


400 


214 


738 


382 


424 


\Tsym\ 


20 


518 93 


21 


6 


203 


89 


21 


35 


61 


21 


22 



Table 2: Size of interactions lists per level for dFMM (hf and If stands for high- and low-frequency regime, 
respectively) 
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(a) sphere (b) oblate spheroid (c) prolate spheroid 



Figure 6: The example geometries are centered at (0,0,0) 

5.1 Accuracy of the method 

Both, the bbFMM and the dFMM, have two approximations: 1) the interpolation of the kernel functions 
determined by interpolation order t, and 2) the subsequent low-rank approximation of the M2L operators 
determined by the target accuracy e. The final relative error is a result of both approximations. We compute 
it as ^ 

where M is the number of particles x in an arbitrary reference cluster at the leaf level; / and / are the exact 
and approximate results, respectively. In Fig. [7] we compare achieved accuracies for the bbFMM and the 
dFMM with the lAblk variant (other variants produce identical results). Both plots show the behavior of the 
relative error el^ depending on the interpolation order i and the target accuracy e. Evident is the matching 
result between the left and right figure. All curves show an initial plateau and then, after a sharp knee a 
slope of roughly 1. The knee occurs approximately at {£, e) — {Acc, 10"'*^'^). In the rest of the paper we use 
this convention to describe the accuracy Acc of bbFMM and dFMM. The low-rank approximations for the 
computations whose accuracies are shown in Fig. [7] were conducted with the ACA followed by a truncated 
SVD. By just using a truncated SVD we obtain identical results. 

5.2 Reducing the cost with the SArcmp variant 

The cost of applying an approximate M2L operator mainly depends on its rank k. In Tab. [3] we compare 
the average rank of M2L operators for the SA and the lA variants at all levels that have expansions. Let us 
explain how we computed the average rank for SA. Recall, when we use that variant, all M2L operators from 
one interaction list posses the same rank; the bbFMM and the dFMM in the low-frequency regime have one 
and the dFMM in the high-frequency regime has potentially multiple (directional) interaction lists per level. 
Thus, the average rank per level is the average of all ranks used in all interactions at that level. 

The application of one M2L operator from SA and I A requires 0{k^), respectively 0{2k£^) operations. 
Note, the ranks k of the M2L operators are different; for a given accuracy Acc they are normally significantly 
lower for lA than for SA. This can be seen in Tab. [31 The large ranks at level 7 (first level in the low-frequency 
regime) of SA are noteworthy. There, the lower bound for the separation criterion is given by the usual low- 
frequency criterion saying that non touching cells are admissible. Hence, the smallest possible transfer 
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Figure 7: Accuracies for the prolate sphere from Fig. |5cl The target accuracy Eaca refers to the accuracy 
of the approximate M2L operators (see Eqn. (1^01) V Here, we used the ACA fohowed by a truncated SVD. 







SA 






lA 






Acc 


4(hf) 


5(hf) 6(hf) 


7(lf) 


4(hf) 


5(hf) 


6(hf) 


7(lf) 


3 


9.8 


12.3 12.8 


19 


5.4 


5.7 


5.7 


5.0 


5 


21.7 


30.8 39.2 


71 


11.3 


12.3 


13.5 


12.6 


7 


38.2 


58.6 80.0 


163 


18.9 


21.5 


24.7 


24.1 


9 


57.8 


96.5 138.7 


296 


28.6 


33.2 


39.7 


40.1 



Table 3: Comparison of average ranks k for the SA and lA variants of the dFMM (prolate sphere) 
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vectors have a length of mintgT \t\ = 2. The slowly decaying singular values of associated M2L operators are 
responsible for the large ranks. On the other hand, the upper bound for the separation criterion coincides 
with the lower bound of level 6 (parent level), which is in the high-frequency regime. Hence, the largest 
possible transfer vectors have a length of maxtgT 1^1 ~ 4fc. M2L operators whose transfer vectors are in that 
range have much faster decaying singular values. This fact explains the efficiency of SArcmp (individual 
recompression of each M2L operator). 

In Tab. |4] we analyze the cost savings of SArcmp compared to SA. The left values in each column 
multiplied by 10^ give the overall number of floating point operations per level for SA. The right values (in 
brackets) indicate the respective cost ratio of SArcmp to SA. The recompression reduces the cost remarkably 
(see also Fig. At the low-frequency level 7, SArcmp reduces the cost by more than a factor of 2. This is 
almost twice as much as in high frequency levels. For the impact on timing results we refer to Sec. 15.41 



Acc 



cost(SA)/106 
4(hf) I 5(hf) 



(cost(SArcmp)/ cost(SA)) 

I 6(hf) I 7(lf) 




204.3 (0.62) 

2889.3 (0.47) 

15420.0 (0.40) 

51505.7 (0.38) 



Table 4: Comparison of cost in terms of floating point operations for SA and SArcmp (prolate sphere) 

In Fig. [5] we compare the cost of S A, SArcmp and lA for bbFMM and dFMM for the prolate sphere and 
accuracy Acc = 9. The bbFMM in Fig. |8a]has expansions at the levels 2 — 7. The reason for the jump 
between level 4 and 5 is because the levels 2 — 4 have a maximum of |r| = 16 and the levels 5 — 7 a maximum 
of |r| = 316 (common maximum size for bbFMM) possible M2L operators. The jump in the case of dFMM 
in Fig. [8b]at level 5 can be explained in the same way: if we look at Tab. [2]we see that |r| is about twice as 
large as it is at the other levels. The levels 4, 5, 6 of dFMM are in the high-frequency regime. There, the cost 
of lA is approximately 12, 5, 3 times greater than the cost of SA. However, at level 7 (low-frequency regime) 
of dFMM and at the levels 5 — 7 of bbFMM the cost of lA is only about 2/3 compared to SA. The reason is 
the size of the interaction lists |T|. The larger they become the larger the span between the smallest and the 
largest rank and that favors individual approximations. The SArcmp is computationally the least expensive 
variant. Similar results are obtained for all other accuracies. 



5.3 Speeding up the precomputation with I A variants 

The bottleneck of SA and SArcmp is the relatively large precomputation time. This is a minor issue in the 
case of homogeneous kernel functions. In that case the approximated M2L operators can be stored on the disk 
and loaded if needed for further computations. However, if we deal with non-homogeneous kernel functions, 
such as the Helmholtz kernel, lAsym is the way to go. In Tab. [5] we compare the precomputation time of 
SA, lA and lAsym (we do not report on SArcmp; due to the additional recompression its precomputation 
time is higher than the one for SA). For the low-rank approximation we use a truncated SVD or the AC A 
followed by a truncated SVD. In both cases we end up with the same rank. We get remarkable speedups in 
the precomputation. Let us look at an extreme example: for the sphere and an accuracy Acc — 7 we have 
^SVD = 4740.9 s for SA versus ^aca — 1-8 s for lAsym. This corresponds to a speedup greater than 2600. 
For the oblate and prolate sphere we obtain similar results. 

5.4 Comparison of all M2L variants 

In the previous sections we have revealed the impact of SArcmp and lAsym on the computational cost and 
the precomputation time of the M2L operators, respectively. In this section we compare all variants and 
focus on their impact on memory requirement and running time (of applying M2L operators during the 
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octree level octree level 

(a) bbFMM (b) dFMM 

Figure 8: Comparison of the M2L cost (floating point operations) growth per level for the SA, SArcmp and 
lA variants of bbFMM and dFMM {Acc = 9 and prolate sphere) 



actual matrix- vector product). Tables IBH51 (respectivelv. I MTTj) present running times of the M2L operators 
for bbFMM (respectively, dFMM) and all three geometries. The upper set of rows reports on results obtained 
with a BLAS and LAPACK implementation (libblas 1.2.20110419-2, which is sub-optimal in our case). The 
lower set shows the times obtained with the Intel MKL Q (Intel MKL 10.3.11.339 (intcl64)), which proved 
faster for our purpose. 

Impact of using symmetries on the memory requirement Missing times in the tables indicate that 
the required memory exceeded the available memory of 8 GB. Clearly, the NA variant (no approximation, no 
use of symmetries) requires the most memory. On the other hand, lAsym and lAblk are the most memory 
friendly variants. Both require only memory for storing the low-rank representations of M2L operators from 
Tsym (see Tab. [5] for a comparison of T and Tsym)- 

Impact of the blocking scheme on runtime Bold numbers indicate the fastest variants. Two results 
attract our attention. 1) If we look at the times in the Tabs. I6HTT] we notice that in four cases (bbFMM 
with the sphere, the oblate sphere and the prolate sphere and dFMM with the prolate sphere) lAblk, and 
in two cases (dFMM with the sphere and the oblate sphere) SArcmp is fastest. To be more general, lAblk 
wins at levels having non-directional expansions (all levels of bbFMM and low-frequency levels of dFMM) 
and SArcmp at levels having directional expansions (high-frequency levels of dFMM). Why is that? The 
reason follows from the cost studies in Sec. 15.21 Let us take for example Acc — 5. Recall, we measure the 
computational cost based on the size of the approximate M2L operators, ie., O(fc^) for SA and 0{2k£^) for 
lA. The respective ranks k are given in Tab. [31 The ratio of these costs for SA to lA is 0.46 at the high- 
frequency level 6 and 1.60 at the low-frequency level 7. As a matter of fact, at high-frequency levels wins SA 
and at low-frequency levels lA. Even savings of 47% at the low-frequency level due to the recompression of 
SArcmp (see Tab. |4]) are not sufficient to outweigh the advantage of the blocking scheme of lAblk. If there 
is no low-frequency level, such as for the sphere in Tab. [HI and the oblate sphere in Tab. [TUl the SArcmp 
outperforms all other variants. For example, if we would repeat the computations for the prolate sphere 
with an octree of depth 6 (no low-frequency level) the resulting timing patterns would follow those from 
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SA 


lA 




lAsym 




Acc 


isvD [s] 


iACA [s] 


tsVB [s] ^ACA [s] 


isVD [s] ^ACA [s] 


sphere 


3 


7.0 


4.6 


6.2 


1.9 


0.2 


0.1 


4 


75.0 


20.2 


37.0 


4.9 


1.1 


0.2 


5 


317.2 


69.1 


188.8 


12.4 


5.6 


0.4 


6 


1336.4 


197.0 


790.5 


28.6 


22.9 


0.9 


7 


4740.9 


435.9 






84.0 


1.8 


oblate sphere 


3 


1.4 


0.9 


1.2 


0.4 


0.1 


0.0 


4 


13.4 


3.8 


7.1 


1.1 


0.5 


0.1 


5 


59.7 


13.5 


37.0 


2.7 


2.8 


0.2 


6 


299.4 


39.4 


150.1 


6.4 


11.1 


0.5 


7 


1021.2 


97.6 


551.0 


13.3 


40.9 


0.9 


8 




217.7 


1751.4 


29.0 


129.3 


2.0 


9 




444.6 






369.2 


3.9 


prolate sphere 


3 


0.6 


0.7 


0.8 


0.3 


0.1 


0.0 


4 


7.5 


4.1 


5.3 


0.7 


0.4 


0.1 


5 


64.7 


18.3 


31.0 


2.1 


2.5 


0.1 


6 


358.2 


73,3 


199.4 


5.0 


15.7 


0.4 


7 


1374.8 


204.3 


808.9 


11.6 


66.0 


0.8 


8 




549.3 




25.0 


322.3 


1.7 


9 




1273.2 




50.2 


807.4 


3.7 



Table 5: Precomputation times (SVD versus ACA) for the SA, lA and lAsym variants. Missing numbers 
mean that the available memory of 8 GB has not been sufficient for the respective computation. 

the sphere and the oblate sphere (the overall application time would increase too, since the tree-depth is 
based on our choice of balancing near- and far- field, i.e., the shortest overall application time). 2) Evident 
is the wide margin in the speedup of variants that use blocking and those which do not. If we use the MKL 
(as opposed to libblas) for NA, NAsym, SA, SArcmp, lA and lAsym we end up with 1.5 — 2 times faster 
application times. However, if we use the MKL for NAblk and lAblk we achieve 3 — 4 times faster times. 
Even though these speedups are greatest with the MKL library, it highlights the benefits of the blocking 
scheme presented in Alg. [51 

Varying growth of application times In Fig.[9]we visualize the different growth rates of M2L application 
times for the bbFMM with increasing accuracies Acc. We are interested in the growth rates due to algorithmic 
changes. That is why we only study those variants that do not use blocking. Since no approximation is 
involved the times for NAsym grows the fastest. The times for SA grow slower but still faster than those for 
lAsym. SArcmp features the slowest growth, it is the optimal variant in terms of computational cost (see 
Tab.©. 

6 Conclusion 

The fast multipole method based on Chebyshev interpolation, first presented in [5] for smooth kernel functions 
(bbFMM) and extended in ^] to oscillatory ones (dFMM) , is a convenient-to-use algorithm due to its easy 
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Acc 


NA 


NAsym 


NAblk 


SA 


SArcmp 


lA 


lAsym 


lAblk 


libblas 1.2.20110419-2 


3 


0.7 


0.8 


0.4 


0.5 


0.3 


0.4 


0.4 


0.3 


4 


3.6 


3.8 


1.8 


1.4 


0.9 


1.3 


1.4 


0.7 


5 


14.2 


12.8 


5.9 


4.6 


2.2 


3.4 


3.6 


1.8 


6 


42.8 


38.2 


16.5 


11.5 


4.8 


9.0 


8.7 


4.2 


7 


102.7 


101.7 


40.7 


25.5 


9.6 


20.2 


18.4 


8.6 


8 


229.3 


234.2 


89.7 


47.8 


18.3 


40.7 


35.9 


16.2 


9 




484.5 


180.0 


83.8 


30.0 


74.0 


08. () 


28.4 


Intel MKL 10.3.11.339 (intel64) 


3 


0.3 


0.3 


0.2 


0.2 


0.4 


0.4 


0.4 


0.2 


4 


1.8 


1.2 


0.6 


0.6 


0.7 


0.7 


0.8 


0.4 


5 


9.4 


6.3 


1.9 


2.0 


1.4 


2.0 


2.0 


1.0 


6 


28.2 


14.1 


4.8 


6.9 


2.6 


5.0 


4.0 


1.7 


7 


72.6 


57.7 


12.0 


19.3 


5.8 


12.7 


8.2 


3.5 


8 


127.6 


117.0 


25.6 


34.3 


12.0 


24.6 


16.2 


6.0 


9 




260.5 


50.8 


60.4 


20.6 


44.6 


32.9 


9.9 



Table 6: M2L timings for the bbFMM (sphere). In this table and below as well, bold numbers correspond to 
the smallest entry in a row. In some cases, two columns use bold font when the running times are sufficiently 
close that the difference is not significant. 



Acc 


NA 


NAsym 


NAblk 


SA 


SArcmp 


lA 


lAsym 


lAblk 


libblas 1.2.20110419-2 


3 


0.3 


0.4 


0.2 


0.2 


0.1 


0.2 


0.2 


0.1 


4 


1.6 


1.6 


0.8 


0.8 


0.4 


0.6 


0.6 


0.3 


5 


6.4 


5.8 


2.6 


1.9 


0.9 


1.6 


1.6 


0.8 


6 


19.1 


16.7 


7.4 


5.4 


2.1 


4.0 


3.9 


1.9 


7 


46.8 


44.8 


18.3 


11.2 


4.5 


9.0 


9.0 


3.8 


8 


102.8 


101.9 


40.1 


21.3 


8.4 


18.2 


16.5 


7.2 


9 




212.2 


81.1 


37.0 


13.5 


32.4 


30.4 


12.8 






lut 


'I MKL 10.:-i.ll.;-!;-!9 (lutrKU) 






3 


0.1 


0.1 


0.1 


0.1 


0.2 


0.2 


0.2 


0.1 


4 


0.7 


0.5 


0.3 


0.3 


0.3 


0.3 


0.4 


0.2 


5 


4.5 


2.0 


0.9 


1.0 


0.7 


0.9 


0.9 


0.4 


6 


13.1 


6.7 


2.2 


3.2 


1.2 


2.6 


1.8 


0.8 


7 


33.7 


27.4 


5.4 


8.2 


2.7 


5.7 


4.4 


1.6 


8 


57.9 


52.7 


11.5 


14.6 


5.2 


10.9 


7.2 


2.7 


9 


117.4 


118.2 


22.9 


27.3 


9.1 


20.0 


14.2 


4.4 



Table 7: M2L timings for bbFMM (oblate sphere) 
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Acc 


NA 


NAsym NAblk 


SA 


SArcmp 


lA 


lAsym 


lAblk 


libblas 1.2.20110419-2 


3 


0.2 


0.2 


0.1 


0.1 


0.1 


0.1 


0.1 


0.1 


4 


1.0 


1.1 


0.5 


0.4 


0.3 


0.4 


0.4 


0.2 


5 


4.0 


4.0 


1.9 


1.2 


0.6 


1.2 


1.0 


0.6 


6 


12.0 


11.8 


4.7 


3.4 


1.3 


2.6 


2.4 


1.4 


7 


29.3 


31.4 


11.7 


6.9 


2.8 


7.2 


5.2 


2.9 


8 


68.4 


71.4 


25.5 


13.1 


5.2 


11.6 


10.8 


4.9 


9 


l:-!1.7 


i:-!7.;-i 


50.7 


22.:-! 


8.6 


21.1 


19.9 


8.7 


Intel MKL 10.3.11.339 (intel64) 


3 


0.1 


0.1 


0.1 


0.1 


0.1 


0.1 


0.1 


0.1 


4 


0.4 


0.3 


0.2 


0.2 


0.2 


0.2 


0.2 


0.1 


5 


2.9 


1.3 


0.6 


0.6 


0.4 


0.6 


0.6 


0.3 


6 


8.2 


4.2 


1.4 


2.1 


0.7 


1.5 


1.1 


0.5 


7 


20.5 


16.3 


3.4 


5.9 


1.7 


3.7 


2.4 


1.0 


8 


35.7 


34.1 


7.2 


9.1 


3.7 


7.1 


5.5 


1.7 


9 


73.3 


73.8 


14.4 


17.6 


5.9 


12.8 


11.2 


2.8 



Table 8: M2L timings for bbFMM (prolate sphere) 



Acc 


NA 


NAsym 


NAblk 


SA 


SArcmp 


lA 


lAsym 


lAblk 


libblas 1.2.20110419-2 


3 


6.3 


5.9 


4.8 


2.0 


2.0 


3.2 


2.9 


3.5 


4 


25.7 


25.1 


20.1 


4.3 


3.9 


8.0 


8.1 


8.3 


5 


113.0 


89.5 


71.4 


7.4 


7.1 


19.8 


19.4 


19.7 


6 




275.9 


202.2 


14.5 


11.6 


43.0 


42.8 


40.4 


7 














86.5 


78.1 


Intel MKL 10.3.11.339 (intcl64) 


3 


4.8 


3.9 


2.3 


1.6 


1.7 


3.2 


2.4 


2.2 


4 


21.5 


17.1 


7.5 


3.3 


3.4 


6.7 


5.9 


4.8 


5 


109.1 


61.5 


22.8 


6.3 


6.3 


10.6 


14.0 


10.0 


6 




162.2 


58.8 


11.6 


9.8 


35.1 


29.7 


17.6 


7 














60.8 


30.8 



Table 9: M2L timings for dFMM (sphere); high-frequency leaf level 
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Acc 


NA 


NAsym NAblk 


SA 


SArcmp 


lA 


lAsym 


lAblk 


libblas 1.2.20110419-2 


3 


1.8 


1.8 


1.6 


0.6 


0.6 


0.9 


0.9 


1.1 


4 


8.9 


8.6 


6.8 


1.3 


1.1 


2.7 


2.6 


2.7 


5 


31.4 


30.6 


24.8 


2.8 


2.3 


6.9 


6.8 


6.9 


6 


91.8 


95.9 


71.6 


5.5 


4.1 


15.9 


15.9 


14.8 


7 




228.4 


176.6 


9.6 


7.3 


32.1 


32.3 


28.9 


8 








16.4 


11.2 


59.2 


59.6 


52.7 


9 








25.5 


21.9 




105.1 


90.1 


Intel MKL 10.3.11.339 (intel64) 


3 


1.4 


1.0 


0.7 


0.4 


0.4 


0.7 


0.8 


0.6 


4 


7.2 


5.0 


2.4 


0.9 


1.0 


2.0 


1.8 


1.3 


5 


25.8 


20.7 


7.9 


2.2 


1.9 


5.5 


4.5 


3.1 


6 


62.5 


56.7 


20.4 


4.3 


3.4 


12.6 


10.3 


5.9 


7 




141.3 


48.8 


7.8 


5.5 


24.7 


21.8 


11.0 


8 








13.6 


9.5 


45.3 


42.0 


18.8 


9 








21.0 


14.7 




79.6 


30.9 



Table 10: M2L timings for dFMM (oblate sphere); high-frequency leaf level 



Acc 


NA 


NAsym NAblk 


SA 


SArcmp 


lA 


lAsym 


lAblk 


libblas 1.2.20110419-2 


3 


0.6 


0.6 


0.5 


0.3 


0.2 


0.3 


0.3 


0.3 


4 


3.2 


2.8 


2.3 


1.0 


0.5 


0.9 


0.9 


1.0 


5 


11.9 


10.0 


8.7 


2.7 


1.2 


2.4 


2.4 


2.5 


6 


32.5 


30.8 


25.1 


6.5 


2.6 


6.0 


5.6 


5.7 


7 


80.3 


77.0 


61.1 


13.2 


5.1 


12.1 


11.8 


11.1 


8 








24.8 


9.1 


24.3 


23.3 


21.2 


9 








47.5 


18.6 


43.5 


43.5 


37.2 


Intel MKL 10.3.11.339 (intel64) 


3 


0.3 


0.3 


0.2 


0.2 


0.2 


0.2 


0.2 


0.2 


4 


2.1 


1.4 


0.7 


0.5 


0.4 


0.6 


0.6 


0.4 


5 


8.7 


5.8 


2.5 


2.0 


1.0 


2.0 


1.5 


1.0 


6 


20.2 


17.9 


6.7 


5.1 


2.0 


4.6 


3.4 


2.0 


7 


48.4 


46.5 


16.3 


10.3 


4.0 


9.6 


7.5 


3.8 


8 








16.6 


7.2 


18.3 


15.8 


7.0 


9 








31.6 


14.8 


33.7 


32.1 


11.6 



Table 11: M2L timings for dFMM (prolate sphere); low- frequency leaf level 
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implementation and its kernel function independence. In this paper we investigated algorithms to reduce the 
running time of the M2L operator. We proposed several optimizations and studied their respective strengths 
and weaknesses. 

On one hand we proposed SArcmp, which uses an individual recompression of the suboptimally approx- 
imated M2L operators obtained via SA (the variant presented in [5]). We have shown that this new variant 
reduces the computational cost noticeably. In some settings it even provides the fastest M2L application 
times. On the other hand we also proposed a new set of optimizations based on an individual low-rank 
approximation of the M2L operators; we refer to them as lA variants. As opposed to SA they directly lead 
to the optimal low-rank representation for each operator. The overall number of flops is greater than for 
SArcmp (which is strictly a lower bound on the number of flops). However, the advantage of the individual 
treatment of the M2L operators is that we can exploit symmetries in their arrangement. This means that 
all operators can be expressed as permutations of a subset. For example, in the case of the bbFMM (in 
which the full interaction list has a constant size), we need to approximate and store only 16 instead of 
316 operators. The remaining ones can be expressed as permutations thereof. This has a great impact on 
the precomputation time and the memory requirement. Moreover it allows to express (again in the case 
of the bbFMM) the at most 189 matrix-vector products (applications of the M2L operators) as at most 16 
matrix-matrix products. We referred to this approach as the lAblk variant. It can then take advantage of 
highly optimized implementations of matrix-matrix operations (e.g., the MKL Q). 

Let us conclude by comparing SArcmp and lAblk, the two variants that have the fastest running times. 
lAblk wins if we compare precomputation time, required memory and runtime at levels having non-directional 
expansions (bbFMM and low- frequency levels in dFMM). SArcmp wins only if we compare the runtime at 
levels having directional expansions (high-frequency levels in dFMM). However, in order to identify the 
optimal variant we have to distinguish two potential uses of the FMM as a numerical scheme to perform 
fast matrix-vector products. 1) If we are interested in the result of a single matrix-vector product, a quick 
precomputation is essential. However, 2) if we are looking for the iterative solution of an equation system 
(many matrix- vector products), a fast running time of the M2L operator is crucial. Let us explain this 
with an example. We take dFMM (with MKL) with accuracy Acc — 5 for the sphere. lAblk wins if we 
are interested in the former use. The precomputation takes 0.4 s versus 69.1s (for SArcmp) and the M2L 
application takes 10.0 s versus 6.3 s, which sums up to 10.4 s versus 75.4 s. All other operators (P2P, P2M, 
M2M, L2L and L2P) have nearly the same runtime in both cases, and their runtimes are negligible compared 
to M2L. Looking at the latter use, SArcmp starts being faster if the iterative solution requires more than 19 
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matrix-vector products. For higher accuracies this threshold rises, e.g., for Acc = 6 it hes at 26 matrix-vector 
products. In the case of bbFMM, lAblk is always optimal. 
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